What is modeling?
Can mean 3 things:
Error & cost:
Features:
Tables
Structured Data
Unstructured Data
Common types of structured data
Applied to credit loan problem:
m = number of data points
n = number of attributes
x_ij = jth attribute of ith data point
x_i1 = credit score of person 1
x_i2 = income of person i
y_i = response for data point i
y_i = 1 (if data point i is blue, -1 if data point i is red)
line = (a_1 * x_1) + (a_2 * x_2) + (a_n * x_n) + a_0 = 0
Notes:
Scaling linearly:
Scaling to a normal distribution:
Which method to use? Depends:
Idea:
Things to keep in mind:
Data has 2 types of patterns:
"Fitting" matches both:
How to measure a model's performance:
What if we want to compare 5 runs of SVM and KNN?
Problems:
Solution:
How much data to use?
How to split data?
Idea: For each of the k parts - train the model on all the other parts and evaluate it on the one remaining.
k-1 times for training, and 1 time for validationk is 10.k splits. Train the model again using all the data.Definition: Grouping data points
Use:
n dimensionsp is set to infinity.Why use infinity norms?
How it works:
k cluster centers within range of dataCharacteristics:
heuristic: Fast, good but not guaranteed to find the absolute best solution.k as well.k for optimization.k == # of data points may be the most theoretically optimal, but does that actually make sense for the task? k vs total distance to find the "elbow" of the curve. At a certain number, the benefit of adding another k becomes negligible.Classification task: Given a new data point, determine which cluster set the new data point belongs to. To do this, simply put it into whichever cluster centroid it is closest to.
Another classification task: What range of possible data points would we assign to each cluster?
Image of cluster region, aka "Voronoi diagram"
Data Preparation: Quantitative Examples
Things to watch out for before building models:
Definition: Data point that's very different from the rest.
Depends on the data:
Actions:
Definition: Determining whether something has changed.
Why:
Definition: Detect increase/decrease or both by the cumulative sum.
How:
Terms:
Formula to detect increase (True if $S_t \ge T$)
$$ S_t = max{\left\{0, S_{t-1} + (X_t-\mu-C) \right\}} $$Formula to detect decrease (True if $S_t \ge T$). $\mu$ and $X_t$ is flipped.
$$ S_t = max{\left\{0, S_{t-1} + (\mu-X_t-C) \right\}} $$Note: Both can be used in conjunction to create a control chart, where $S_t$ is plotted over time and if it ever gets beyond this threshold line, it shows that CUSUM detected a change.
Time series data will have a degree of randomness. Exponential smoothing accounts for this by smoothing the curve.
Example:
Time series complexities
Trends
Exponential Smoothing, but with $T_t$ (trend at time period $t$): $$S_t = \alpha x_t + (1 - \alpha)(S_{t-1} + T_{t-1})$$
Trend at time $t$ based on delta between observed value $S_t$ and $S_{t-1}$ with a constat $\beta$. $$T_t = \beta (S_t - S_{t-1}) + (1-\beta)T_{t-1}$$
Cyclic
Two ways to calculate:
Baseline formula (including trend + seasonality)
$$ S_t = \frac{\alpha x_t}{C_{t-L}} + (1 - \alpha)(S_{t-1} + T_{t-1}) $$Update the seasonal (cyclic) factor in a similar way as trends:
$$C_t = \gamma(\frac{x_t}{S_t}) + (1 - \gamma)C_{t-L}$$Example: Sales trends
Starting Conditions
For trend:
For multiplicative seasonality
"Smoothing"
Graph of what it looks like:
"Exponential"
Each time period estimate can be plugged in like this:
Given basic exponential smoothing equation
$$S_t = \alpha x_t + (1-\alpha)S_{t-1}$$We want to make a prediction $S_{t+1}$. Since $X_{t+1}$ is unknown, replace it with $S_t$.
Using $S_t$, the forecast for time period $t+1$ is
$$F_{t+1} = \alpha S_t + (1-\alpha)S_t$$hence, our estimate is the same as our latest baseline estimate
$$F_{t+1} = S_t$$Factoring in trend/cycle
Above equation can beextrapolated to trend/cycle calculations.
Best estimate of trend is the most current trend estimate:
$$F_{t+1} = S_t + T_t$$Same for cycle (multiplicative seasonality)
$$F_{t+1} = (S_t + T_t) C_{(t+1)-L}$$Where $F_t+k = (S_t + kT_t)C_{(t+1)-L+(k-1)}$, k=1,2,...
3 key parts
1. Differences
For example:
$$D_{(1)t} = (X_t - X_{t-1})$$
$$D_{(2)t} = (x_t - x_{t-1}) - (x_{t-1} - x_{t-2})$$
$$D_{(3)t} = [(x_t - x_{t-1}) - (x_{t-1} - x_{t-2})] - [(x_{t-1} - x_{t-2}) - (x_{t-2} - x_{t-3}]$$
2. Autoregression
Definition: Predicting the current value based on previous time periods' values.
Augoregression's exponential smoothing:
Order-p autoregressive model:
"ARIMA" combines autoregression and differencing
3. Moving Average
ARIMA (p,d,q) model
$$ D_{(d)t} = \mu + \sum_{i=1}^{p}\alpha_i D_{(d)t-i} - \sum_{i=1}^{q}\theta_i(\hat{x}_{t-1} - x_{t-i}) $$Choose:
Other flavors of ARIMA
Definition: Estimate or forecast the variance of something, given a time-series data.
Motivation:
Mathematical Model:
$$ \sum_t^2 = \omega + \Sigma_{i-1}^p\beta_i\sigma_{t-i}^2 + \sum_{i=1}^q \gamma_i\epsilon_{t-i}^2 $$What it explains:
Definition: Linear regression with one predictor
Example:
Sum of Squared Errors $$ \sum_{i=1}^n(y_i - \hat{y}_i)^2 = \sum_{i=1}^n(y_i - (a_0 + a_1 x_{i1})^2 $$
Best-fit regression line
How it works:
AIC applied to regression
Equation:
Example:
Relative likelihood $$ e^\frac{(AIC_1 - AIC_2)}{2} $$
Applied to Models 1 & 2: $$ e^\frac{(75 - 80)}{2} = 8.2\% $$
Result:
Characteristics:
BIC Metrics - Rule of thumb
Baseball Example: Determine average number of runs a home run is worth.
Equation:
$$ Runs Scored = a_0 + a_1[Number of HR] + a_2[Number of Triples] + \ldots + a_m[Other Predictors] $$Applications of LR:
Causation: One thing causes another thing Correlation: Two things tend to happen or not happen together - neither of them might cause the other.
Application in Linear Regression:
How to decide if there is causation?
Method 1: We can adjust the data so the fit is linear.
Method 2: We can transform the response
Transform response with logarithmic function $$ \log(y) = a_0 + a_1x_1 + a_2x_2 + \ldots + a_mx_m $$
Box-Cox transformations (link) can be automated in statistical software.
Example: Variable interaction
Warnings About P-Value
(A bit more explanation on contextualizing the coefficient of 1.0 in this context explained in Piazza):
Notice that in that lecture (M5L6, timestamp ~3:10) he's talking about the value of the coefficient relative to the scale of the attribute in question and the response value. A coefficient of 1.0 on the age attribute isn't that significant because the coefficient is in units of USD/year of age, and the response is a household income. We can make a decent guess what values the age variable will take (probably adult, probably younger than retirement age), and those values multiplied by 1.0 (on the order of tens of dollars) aren't going to make much difference in a US household income (on the order of tens of thousands of dollars). That phrase "the coefficient multiplied by the attribute value" is important.
So this isn't a rule of thumb about the value 1.0. This is advice to keep the scale of your attributes and response in mind when interpreting coefficients.
Adjusted R-squared: Adjusts for the number of attributes used.
Warnings about R-squared:
Key Assumptions of Linear Regression
How to calculate R-squared values directly from cross validation
# R-squared = 1 - SSEresiduals/SSEtotal
# total sum of squared differences between data and its mean
SStat <- sum(dat$Crime - mean(data$CRime))^2)
# for model, model2, and cross-validation, calculated SEres
SSres_model <- sum(model$residuals^2) #model 1
SSres_model2 <- sum(model2$residuals^2)
SSres_c <- attr(c, "ms")*nrow(dat) # MSE, times number of data points, gives sum of squared error
# Calculate R-squared
1 - SSres_model/SStat # initial model with insignificant predictors
# 0.803
1 - SSres_model2/SStat # model2 without insignificant predictors (based on p-value)
# 0.766
1 - SSres_c/SStat # cross-validated
# 0.638
# This shows that including the insignificant factors overfits
# compared to removing them, and even the fitted model is probably overfitted.
# This is not suprising since we only have 47 data points and 15 predictors (3:1 ratio)
# Good to have 10:1 or more.
Q0.1 - Using crime datset, apply PCA and then create a regression model using the first few principal components.
prcomp for PCA (scale data first scale=TRUE)Don't forget to unscale the coefficients to make a prediction for the new city (ie. do the scaling calculation in reverse).
Eigenvalues Correspond to Principal Components
prcomp does all of this under the hoodHow to decide which PC is important
prcomp will rank them for us. Why: Normality assumption
Box-Cox Transformation
Box-Cox Formula:
NOTE: First check whether you need the transformation (e.g. QQ plot)
Why: Time series data will often have a trend (e.g. inflation, seasons) which may bias the model. Need to adjust for these trends to run a correct analysis.
When: Whenever using factor-based model (regression, SVM) to analyze time-series data*
("Factor-based model" uses a bunch of factors to make a prediction, non-factor based model example would be a model using predictors based on time and previous values)
On What:
How:
Why:
What:
D_1 becomes new x-coordinate.D_2 becomes new y-coordinate.$X$: Initial matrix of data (scaled)
Find all of the eigenvectors of $X^TX$
PCA:
How to use beyond linear transformation
Problem: How do we get the regression coefficients for the original factors instead of PCs?
Example (Regression): PCA finds new $L$ factors ${t_{ik}}$, then regression finds coefficients $b_0, b_1, \ldots, b_L$.
$$ \begin{aligned} y_i &= b_0 + \Sigma^L_{k=1} b_k t_{ik} \\ &= b_0 + \Sigma^L_{k=1} b_k [ \Sigma^m_{j=1} x_{ij} v_{jk} ] \\ &= b_0 + \Sigma^L_{k=1} b_k [\Sigma^m_{j=1} x_{ij} v_{jk}] \\ &= b_0 + \Sigma^L_{k=1} x_{ij} [\Sigma^L_{k=1} b_k v_{jk}] \\ &= b_0 + \Sigma^L_{k=1} x_{ij} [a_j] \end{aligned} $$Implied regression coefficient for $x_j$: $$ a_j = \Sigma^L_{k=1} b_k v_{jk} $$
Given $A$: Square matrix
Points to consider:
Example: Classification
Takeaway: